Developing a tagset for automated part - of - speech tagging in Urdu Andrew

نویسنده

  • Andrew Hardie
چکیده

While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has hitherto been done in the area of tagset creation for Urdu. The tagset discussed here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora. Although these guidelines were written to cover the languages of the European Union, they can be applied fairly easily to Urdu, which, coming as it does from another branch of the IndoEuropean family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies presented by Urdu grammar. This paper will look at the process of creating one of the necessary resources for the development of a POS tagging system for Urdu, that of a suitable tagset, considering some of the problems encountered along the way.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing a tagset for automated part-of-speech tagging in Urdu

1. Abstract While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Litt...

متن کامل

Automated part - of - speech analysis of Urdu : conceptual and technical issues

Part-of-speech (POS) tagging is the process of labelling tokens in a text with tags that indicate their morphosyntactic category, and has a wide range of applications in computational and corpus linguistics, such as the production of corpus-based dictionaries and grammars. This paper describes an experiment in extending POS tagging to a hitherto untagged language, Urdu. The most challenging tas...

متن کامل

Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank

This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed...

متن کامل

POS Tagging with a More Informative Tagset

We investigate the impact of introducing finer distinctions into the tagset on the accuracy of partof-speech tagging. This is a tangential approach to most recent research in the field, which has focussed on applying different algorithms using a very similar set of features. We outline the basic approach to tagset refinement and describe preliminary find-

متن کامل

The CLE Urdu POS Tagset

The paper presents a design schema and details of a new Urdu POS tagset. This tagset is designed due to challenges encountered in working with existing tagsets for Urdu. It uses tags that judiciously incorporate information about special morpho-syntactic categories found in Urdu. With respect to the overall naming schema and the basic divisions, the tagset draws on the Penn Treebank and a Commo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004